## [1] 1599 13
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Destribution looks like normal destribution. Most values are composed of 5 and 6.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
The graph is relatively skewed to right, but no extreme outlier.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
At first, I thought it was just normal destribution, but when I looked at data with small binwidth, it turned out to be bimodal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
This distribution is skewed to right. I tried to transform data by using scale_x_sqrt ot scale_x_log10, but it did not change a shape nicely.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
In order to take closer look, I focused on 1.0 to 4.0 residual.sugar. The data destribution is skewed to right and has many outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
## [1] 0.0470653
Chlorides are concentrated around 0.08 and really small standard deviaiton, 0.047.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Since original histogram of total sulfur is skewed to right, I used scale_x_sqrt function to make data more understandable.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
## [1] 0.001887334
This data has normal destribution. One thing I want to keeo in mind is that it has really small standard deviation.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
Normal destribution. Its median and mean are almost equal.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The distribution is skewed to right.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The distribution is smoothly skewed to right.
What_is_the_structure_of_your_dataset? There are 1599 wine data with 12 variables. I deleted X and quality colum and created categorical data$quality colum.
What is/are the main feature(s) of interest in your dataset? As long as I read description of data set, I suspect volatile acidity, residual sugar, and chlorides since these factors seems to directly cause effect on taste of wine. I???d like to determine which features are best for predicting the price of a diamond.
Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this? I square root transformed the right skewed total sulfur distribution. The transformed distribution became more similar to normal distribution shape.
According to correlation matrix, price seems to correlate with volatile.acidity, density, sulphates, and alcohol. #I want to take closer look at scatter plots between them.
Looks like there is a negative correlation
Standard deviation by quality
## data$quality: 3
## [1] 0.002001845
## --------------------------------------------------------
## data$quality: 4
## [1] 0.001575169
## --------------------------------------------------------
## data$quality: 5
## [1] 0.001588504
## --------------------------------------------------------
## data$quality: 6
## [1] 0.002000009
## --------------------------------------------------------
## data$quality: 7
## [1] 0.002175739
## --------------------------------------------------------
## data$quality: 8
## [1] 0.002378276
Density boxplot’s range is mainly overlapped.
ggplot(aes(x=quality,y=sulphates), data=data)+
geom_jitter(alpha=0.3)+
geom_boxplot()
Standard deviaiton of sulphates by quality
## data$quality: 3
## [1] 0.12202
## --------------------------------------------------------
## data$quality: 4
## [1] 0.239391
## --------------------------------------------------------
## data$quality: 5
## [1] 0.1710623
## --------------------------------------------------------
## data$quality: 6
## [1] 0.1586495
## --------------------------------------------------------
## data$quality: 7
## [1] 0.1356389
## --------------------------------------------------------
## data$quality: 8
## [1] 0.1153795
Sulphates variable has many outliers in its boxplot
density and sulphates variables show similar type of distribution in a graph. Both of them change its value within the range of quality, but since its ranges are pretty narrow I am not sure their differences are statistically significant.
## data$quality: 3
## [1] 10
## --------------------------------------------------------
## data$quality: 4
## [1] 53
## --------------------------------------------------------
## data$quality: 5
## [1] 681
## --------------------------------------------------------
## data$quality: 6
## [1] 638
## --------------------------------------------------------
## data$quality: 7
## [1] 199
## --------------------------------------------------------
## data$quality: 8
## [1] 18
Quality 5 and 6 occupy most alcohol values. Even though quality 5 has some outliers, it looks like there is a positive correlation.
The plots are concentrated on low alcohol and relatively low volatile.acidity
Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset? Alcohol and volatile.acidity correlate with quality.
On the other hand, density and sulphates indicte relatively similar value.
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)? Density and fixed.acidity show 0.668 correlation. Alcohol and density indicate -0.496 correlation.
What was the strongest relationship you found? Alcohol and volatile.acidity show relativey strong correlation with quality of wine.
Multivariate Plots Section
You can see the color is changing from up-left to bottom-right. In addition, regression shows that volatile.acidity is more important factor to predict quality of wine than alcohol since many colored regression lines are horizontal, which means colored dots are scattered along with volatile.acidity. From this graph the lower the volatile.acidity become, the better the quality gets.
I suspected if volatile.acidity has correlation with density and sulphates. In fact, I made the same kind of graphs with different values, with density and sulphates.
First of all, scatter plot with volatile.acidity and density with coloured quality
Since mamy regression lines are horizontal, volatile.acidity has stronger correlation with quality than density.
I made scatter plot with suphates.
Although quality 3’s regression line is vertical, other lines are almost horizontal. Again, there is a strong correlation with volatile.acidity and sulphates.
change quality value from factor to integer in order to do multiple regression.
##
## Call:
## lm(formula = data$quality ~ data$volatile.acidity + data$alcohol)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.59342 -0.40416 -0.07426 0.46539 2.25809
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.09547 0.18450 16.78 <2e-16 ***
## data$volatile.acidity -1.38364 0.09527 -14.52 <2e-16 ***
## data$alcohol 0.31381 0.01601 19.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6678 on 1596 degrees of freedom
## Multiple R-squared: 0.317, Adjusted R-squared: 0.3161
## F-statistic: 370.4 on 2 and 1596 DF, p-value: < 2.2e-16
Each independent variables’ p-values are small(<2e-16). Adjusted R-squared is 0.3161. Almost 30 % of quality value is explained by this model.
Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
I compared with volotile.acidity and other thre features, such as alcohol, density and sulphates. As a result, volatile.acidity has stronger correlation with quality of wine over each of three values.
Were there any interesting or surprising interactions between features?
Only watching scatter plot does not make sense, but adding regression lines bring much meaning to me. I learned how impportant to look at the data from different points of views.
First of all, from correlation matrix, I chose two variables, volatile.acidity and alcohol since it seems to me that there is correlation with quality in each scatterplot. So, I decided to take closer look at those variables by using cloured boxplot. These boxplots are Plot One and Plot Two.
Although both graphs show good result, from these boxplots, it looks like volatile.acidity is better variable to predict quality variable.
Therefore, I wanted a graph which includes all three variables at one time. I made scatterplot with colured quality and add regressions by each colours.
Surprisingly, by using regression lines, it is obvious that there is a stronger correlation between volatile.acidity and quality than alcohol and quality because most regression lines are drown horizontally.
I chose this plot simply becaue it reflects my idea that there is a correlation between volatile.acidity and quality. At the same time I thought this coloured box plot could easily convey information of how volatile.acidity effect the quality of wine, the smaller volatile.acidity get, the better quality become.
I chose this plot as with almost same reason as Plot One. Although we can easily see that standard deviation is bigger in each boxplot, still the quality of wine tends to become better as its alcohol is stronger.
## data$quality: 3
## [1] 10
## --------------------------------------------------------
## data$quality: 4
## [1] 53
## --------------------------------------------------------
## data$quality: 5
## [1] 681
## --------------------------------------------------------
## data$quality: 6
## [1] 638
## --------------------------------------------------------
## data$quality: 7
## [1] 199
## --------------------------------------------------------
## data$quality: 8
## [1] 18
After confirming two correlaitons with quality in above plots, I needed to take closer look thrir relationships in one graph. Therefore, I made scatter plot with colored quality and regression lines. This graph clearly states volatile.acidity has stronger correlation with alcohol.
Also regression lines of quality 3 and quality 8 are not horizontal comppared with other lines, but as you can see, each number of plot only has 10 and 18, for quality 3 and quality 8 respectively. If we have more samples for quality 3 and 8, the regression lines might be changed.
From my research, I confirmed that there is a correlation quality and some variables. Based on my plot three, there is a certain pattern in its graph. Volatile.acidity has stronger correaltion with quality than alchol. In addition, multiple regression made from plot three shows 0.3161 adjusted r-squared.
To be honest, before drawing regression lines in my plot three, I did not think I was doing well since correlations with quality and other variables look really weak. However, by using regression lines in the scatterer plot, my idea is forced to change since there was obviously something to tell the result. Through this courses, I learned how should I look from various points of views. In the future, I would like to learn different points of view to observe data so that I would not miss important points.